This document demonstrates basic data analysis and plotting using the HAVEN datasets in R. We will firstly plot the data using a variety of plots to understand how certain graphs are preferred for certain data. We will then perform basic statistical tests to understand whether the relationship between the variables are significant.
First thing first, we need to load and read the data. The first
dataset is in the participant_data.csv file. This file
format is called ‘comma separated values’ or CSV, which we can read into
R using the read_csv function. We firstly assign the data
to a variable called data, which we can now work directly
with. You can see this on the right hand panel called Data
in R ->
Just to see if we have loaded in the correct file and correctly
assigned it, we can display the first few rows of data
using the head function:
head(data)
## # A tibble: 6 × 9
## BIDS_ID age sex height weight bmi lesion_volume lesion_number education
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sub-0001 54 M 1.81 105 32.2 1.31 11 6
## 2 sub-0002 56 M 1.93 118 31.7 0.184 5 7
## 3 sub-0003 54 F 1.64 60 22.4 0.476 5 3
## 4 sub-0004 55 F 1.74 71 23.6 2.14 12 4
## 5 sub-0005 59 M 1.7 82.4 28.5 0.234 4 6
## 6 sub-0006 58 F 1.65 69 25.3 0.119 4 7
We can also calculate summary statistics for each variable using
summary. The summary statistics will give us the mean,
range etc for each.
summary(data)
## BIDS_ID age sex height
## Length:52 Min. :52.00 Length:52 Min. :1.450
## Class :character 1st Qu.:57.50 Class :character 1st Qu.:1.629
## Mode :character Median :63.00 Mode :character Median :1.708
## Mean :63.60 Mean :1.709
## 3rd Qu.:69.25 3rd Qu.:1.791
## Max. :81.00 Max. :1.930
## weight bmi lesion_volume lesion_number
## Min. : 49.00 Min. :18.10 Min. : 0.1190 Min. : 4.000
## 1st Qu.: 65.60 1st Qu.:22.38 1st Qu.: 0.5488 1st Qu.: 8.000
## Median : 71.75 Median :25.15 Median : 0.8456 Median : 9.500
## Mean : 74.58 Mean :25.49 Mean : 1.5247 Mean : 9.885
## 3rd Qu.: 85.00 3rd Qu.:27.99 3rd Qu.: 1.2872 3rd Qu.:11.250
## Max. :118.00 Max. :34.90 Max. :15.8031 Max. :25.000
## education
## Min. :2.000
## 1st Qu.:3.000
## Median :6.000
## Mean :5.231
## 3rd Qu.:7.000
## Max. :8.000
We can also visualize the data using various plots. For example, if we wanted to understand whether people who are older have a higher lesion volume (possibly as a result of aging) we can create a scatter plot of age and total lesion volume, and add a line of best fit.
ggplot(data, aes(x = age, y = `lesion_volume`)) +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = TRUE, color = "darkred", linetype = "solid") +
labs(title = "Scatter plot of age and Total lesion volume",
x = "age",
y = "lesion_volume") +
theme_minimal()
So, you can see a positive correlation, namely that (generally) the older you are, the higher the total lesion volume will be.
We can also group these results by sex, if for example we wanted to know whether there is a gender effect. We can do this using a boxplot, specficially a violin plot, a type of boxplot but which looks a bit nicer.
ggplot(data, aes(x = sex, y = `lesion_volume`)) +
geom_violin(aes(fill = sex), scale = "area", trim = TRUE) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Violin plot of Total lesion volume by sex",
x = "sex",
y = "lesion_volume") +
theme_minimal() +
theme(legend.position = "none")
There seems to be a similar relationship between both genders, but more of a spread among Males, possibly due to the single outlier at the top. You can see this outlier in the scatter plot above as well.
If we wanted to understand whether there is a difference between the height and weight between Males and Females, we can do this using a scatter plot, grouped by Sex:
g <- ggplot(data, aes(x=height, y=weight, colour=sex))
g <- g + geom_point()
g
So you can see that males are generally towards the top and right (i.e., taller and heavier).
Another type of graph that we can use is a histogram. Histograms are useful if we want to understand the spread of a particular variable for different categories. For example, if we wanted to understand the distribution of bmi or age across the group.
Firstly, BMI:
ggplot(data, aes(x = bmi)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 1, fill = "steelblue", color = "black") +
geom_density(color = "darkred", linetype = "solid", linewidth = 1) +
labs(title = "Histogram of bmi with Density Curve",
x = "bmi",
y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"))
So you can see that the most common BMI is 25. You can see that the distribution of BMI is lower for both low and high bmi (under and overweight) and higher in the middle. This variable is said to be ‘normally distributed’. More examples of normally distributed data include height and weight (which makes sense as they constitute BMI!).
Height:
ggplot(data, aes(x = height)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 0.01, fill = "steelblue", color = "black") +
geom_density(color = "darkred", linetype = "solid", linewidth = 1) +
labs(title = "Histogram of height with Density Curve",
x = "height",
y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"))
Weight:
ggplot(data, aes(x = weight)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5, fill = "steelblue", color = "black") +
geom_density(color = "darkred", linetype = "solid", linewidth = 1) +
labs(title = "Histogram of weight with Density Curve",
x = "weight",
y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"))
Let’s do the same but for age:
ggplot(data, aes(x = age)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 1, fill = "steelblue", color = "black") +
geom_density(color = "darkred", linetype = "solid", linewidth = 1) +
labs(title = "Histogram of age with Density Curve",
x = "age",
y = "Density") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"))
Conversely to BMI, height and weight we see that age is not normally distributed, but this makes because we only recruited those over 50. If we were recruiting across all age ranges from the general population (i.e., not just from the university where most would be students between 18-25), then we probably would see a normal distribution.
Ok, so we have been able to visualise the data, and we have an idea of some of the trends. But how can we test whether these results are significant? Can we statistically say that Men are taller than Women? What about the relationship between Lesion Volume and age? Can we say that the older you are the higher lesion volume you have?
Firstly, lets get some more summary statistics, this time for height and weight specficially:
group_by(data, sex) %>%
summarise(
count = n(),
mean = mean(height, na.rm = TRUE),
sd = sd(height, na.rm = TRUE)
)
## # A tibble: 2 × 4
## sex count mean sd
## <chr> <int> <dbl> <dbl>
## 1 F 29 1.64 0.0726
## 2 M 23 1.80 0.0739
group_by(data, sex) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
## # A tibble: 2 × 4
## sex count mean sd
## <chr> <int> <dbl> <dbl>
## 1 F 29 65.9 11.0
## 2 M 23 85.5 12.5
You can see that the mean height for women is 1.64 and men is 1.80 (meters), whilst the mean weight is 65.9 and 85.5 kg respectively.
But is this difference significant? To find out the statistical difference between two groups we can simply run a significance test (in this case a t-test).
women<-data[which(data$sex=="F"),]
men<-data[which(data$sex=="M"),]
t.test(women$height, men$height)
##
## Welch Two Sample t-test
##
## data: women$height and men$height
## t = -7.6125, df = 46.946, p-value = 9.77e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1970430 -0.1146661
## sample estimates:
## mean of x mean of y
## 1.640276 1.796130
t.test(women$weight, men$weight)
##
## Welch Two Sample t-test
##
## data: women$weight and men$weight
## t = -5.8948, df = 44.265, p-value = 4.723e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -26.25710 -12.87904
## sample estimates:
## mean of x mean of y
## 65.92759 85.49565
So, we can see in the p-values that there is a statistically significant difference between the height and weight between males and females. (9.77e-10 is a 0 followed by 10 more zeroes so 0.000000000977). What value is used as a cut-off for significance? Why?
The answer is that a p-value of 0.05 is used as a cut-off. A p-value below 0.05 is said to be significant, and above is said to be insignificant. You don’t need to know why 0.05 specifically or what it means, it just helps to know that there is some value that people measured against.
In the case that we want to see whether there is a significant relationship between two continuous variables, i.e., between age and Lesion Volume across the entire group, we can run a correlation. A correlation just checks whether or not as one measure increases, the other also increases and vice versa. For example, we can run a correlation between lesion volume and age to see if as lesion volume increases age also increases:
# Calculate linear regression
regression <- lm(`lesion_volume` ~ age, data = data)
r_value <- cor(data$age, data$`lesion_volume`)
p_value <- summary(regression)$coefficients[2, "Pr(>|t|)"]
# Create plot
ggplot(data, aes(x = age, y = `lesion_volume`)) +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = TRUE, color = "darkred", linetype = "solid") +
labs(title = "Scatter plot of age and Total lesion volume",
x = "age",
y = "lesion_volume") +
theme_minimal() +
annotate("text", x = min(data$age), y = max(data$`lesion_volume`),
label = paste("R = ", round(r_value, 2), "\np-value = ", format.pval(p_value, digits = 2), sep = ""),
hjust = 0, vjust = 1)
So, whilst there is a significant relationship between age and Lesion Volume, we can see one or two outliers in our data. What would the data look like if we removed them?
# Remove 21st and 30th rows
data_removed <- data[-c(21, 30),]
# Perform linear regression
model <- lm(`lesion_volume` ~ age, data = data_removed)
p_value <- summary(model)$coefficients[2, "Pr(>|t|)"]
# Create a scatter plot
ggplot(data_removed, aes(x = age, y = `lesion_volume`)) +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = TRUE, color = "darkred", linetype = "solid") +
labs(title = "Scatter plot of age and Total lesion volume",
x = "age",
y = "lesion_volume") +
scale_y_continuous(limits = c(0, 5)) +
theme_minimal() +
annotate("text", x = min(data_removed$age), y = 5,
label = paste("p-value = ", format.pval(p_value, digits = 2), sep = ""),
hjust = 0, vjust = 1)
So, the relationship is still significant, but less so than before.
Finally, we may also want to see if there is a difference between Males and Females with Lesion Volume and age. For example, maybe women or men are protected against lesions as they age, whilst they other are not. In this case we would expect to see a significant relationship for one group but not the other. Let’s run a correlation for males and females with lesion volume separately, and plot them on the same graph:
# Perform linear regression for each sex
model_male <- lm(`lesion_volume` ~ age, data = data_removed[data_removed$sex == "M",])
model_female <- lm(`lesion_volume` ~ age, data = data_removed[data_removed$sex == "F",])
# Extract p-values
p_value_male <- summary(model_male)$coefficients[2, "Pr(>|t|)"]
p_value_female <- summary(model_female)$coefficients[2, "Pr(>|t|)"]
# Create a scatter plot separating Males and Females
ggplot(data_removed, aes(x = age, y = `lesion_volume`, color = sex)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, linetype = "solid") +
labs(title = "Scatter plot of age and Total lesion volume by sex",
x = "age",
y = "lesion_volume") +
scale_y_continuous(limits = c(0, 5)) +
theme_minimal() +
annotate("text", x = min(data_removed$age), y = 5,
label = paste("Males: p-value = ", format.pval(p_value_male, digits = 2), "\nFemales: p-value = ", format.pval(p_value_female, digits = 2), sep = ""),
hjust = 0, vjust = 1)
We can now see that there is only a significant difference in the relationship between age and total lesion volume for females, and not for males. Why do you think that this is? Regardless, by looking at the data in this way we have found something interesting (which we didn’t see before) that we can subsequently look at further.
Finally, we may also be interested with understanding the potential
impact of demographic and brain factors upon behaviour. To do this we
can look at participants’ reaction times and accuracy during the
episodic memory task using the rt_data dataset.
Again, let’s run head to see the data:
head(rt_data)
## # A tibble: 6 × 11
## Age Sex Height Weight BMI lesion_volume lesion_number correct_percent
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 F 1.62 63 23.9 0.522 8 81.8
## 2 62 M 1.84 85 25.1 1.25 13 95.9
## 3 52 M 1.85 89 26 0.294 6 94
## 4 54 F 1.64 60 22.4 0.476 5 88.9
## 5 69 M 1.89 90 25.1 0.826 9 89
## 6 68 M 1.68 75 26.5 2.51 14 90
## # ℹ 3 more variables: incorrect_percent <dbl>, correct_rt <dbl>,
## # incorrect_rt <dbl>
And summary:
summary(rt_data)
## Age Sex Height Weight
## Min. :52.00 Length:48 Min. :1.450 Min. : 49.00
## 1st Qu.:57.50 Class :character 1st Qu.:1.629 1st Qu.: 64.30
## Median :63.00 Mode :character Median :1.708 Median : 71.75
## Mean :63.46 Mean :1.713 Mean : 74.63
## 3rd Qu.:69.25 3rd Qu.:1.796 3rd Qu.: 85.25
## Max. :78.00 Max. :1.930 Max. :118.00
## BMI lesion_volume lesion_number correct_percent
## Min. :18.10 Min. :0.1842 Min. : 4.000 Min. :77.78
## 1st Qu.:22.30 1st Qu.:0.5488 1st Qu.: 8.000 1st Qu.:89.97
## Median :25.15 Median :0.8456 Median : 9.500 Median :93.94
## Mean :25.37 Mean :1.2798 Mean : 9.646 Mean :92.04
## 3rd Qu.:27.99 3rd Qu.:1.2576 3rd Qu.:11.000 3rd Qu.:95.96
## Max. :34.30 Max. :9.0293 Max. :18.000 Max. :98.99
## incorrect_percent correct_rt incorrect_rt
## Min. : 1.010 Min. :1.450 Min. :1.632
## 1st Qu.: 4.040 1st Qu.:1.952 1st Qu.:2.354
## Median : 6.060 Median :2.110 Median :2.709
## Mean : 7.963 Mean :2.159 Mean :2.719
## 3rd Qu.:10.025 3rd Qu.:2.337 3rd Qu.:3.070
## Max. :22.220 Max. :3.076 Max. :4.332
At a first glance, there are some clear questions that we can investigate using our new-found plotting skills:
Firstly, lets plot the relationship between age and performance on the task, both for the reaction time (only for ones they got correct) and the percentage accuracy:
# Create a theme for scientific publication
theme_publication <- theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
legend.position = "none"
)
# Plot Age against correct_rt
rt_model <- lm(correct_rt ~ Age, data = rt_data)
rt_p_value <- summary(rt_model)$coefficients[2, "Pr(>|t|)"]
rt_plot <- ggplot(rt_data, aes(x = Age, y = correct_rt)) +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "solid") +
labs(title = "Reaction Time vs. Age",
x = "Age",
y = "Mean Correct Reaction Time (ms)") +
theme_publication +
annotate("text", x = min(rt_data$Age), y = max(rt_data$correct_rt),
label = paste("p-value =", format.pval(rt_p_value, digits = 3)),
hjust = 0, vjust = 1, size = 4)
# Plot Age against correct_percent
percent_model <- lm(correct_percent ~ Age, data = rt_data)
percent_p_value <- summary(percent_model)$coefficients[2, "Pr(>|t|)"]
percent_plot <- ggplot(rt_data, aes(x = Age, y = correct_percent)) +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "solid") +
labs(title = "Correct Percentage vs. Age",
x = "Age",
y = "Correct Percentage (%)") +
theme_publication +
annotate("text", x = min(rt_data$Age), y = max(rt_data$correct_percent),
label = paste("p-value =", format.pval(percent_p_value, digits = 3)),
hjust = 0, vjust = 1, size = 4)
# Display the plots
print(rt_plot)
print(percent_plot)
So we can see that older participants took longer (on correct trials), and made more mistakes. Both of these are significant.
Let’s now finally look at each of these but separated by sex:
# Create a theme for scientific publication
theme_publication <- theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
legend.title = element_blank(),
legend.text = element_text(size = 10),
legend.position = "bottom"
)
# Plot Age against correct_rt separated by Sex
rt_model_m <- lm(correct_rt ~ Age, data = rt_data[rt_data$Sex == "M", ])
rt_model_f <- lm(correct_rt ~ Age, data = rt_data[rt_data$Sex == "F", ])
rt_p_value_m <- summary(rt_model_m)$coefficients[2, "Pr(>|t|)"]
rt_p_value_f <- summary(rt_model_f)$coefficients[2, "Pr(>|t|)"]
rt_plot <- ggplot(rt_data, aes(x = Age, y = correct_rt, color = Sex)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, linetype = "solid") +
labs(title = "Reaction Time vs. Age by Sex",
x = "Age",
y = "Correct Reaction Time (ms)") +
scale_color_manual(values = c("darkblue", "darkred"),
labels = c("Female", "Male")) +
theme_publication +
annotate("text", x = min(rt_data$Age), y = max(rt_data$correct_rt),
label = paste("Males: p-value =", format.pval(rt_p_value_m, digits = 3),
"\nFemales: p-value =", format.pval(rt_p_value_f, digits = 3)),
hjust = 0, vjust = 1, size = 4) +
guides(color = guide_legend(title = "Sex"))
# Plot Age against correct_percent separated by Sex
percent_model_m <- lm(correct_percent ~ Age, data = rt_data[rt_data$Sex == "M", ])
percent_model_f <- lm(correct_percent ~ Age, data = rt_data[rt_data$Sex == "F", ])
percent_p_value_m <- summary(percent_model_m)$coefficients[2, "Pr(>|t|)"]
percent_p_value_f <- summary(percent_model_f)$coefficients[2, "Pr(>|t|)"]
percent_plot <- ggplot(rt_data, aes(x = Age, y = correct_percent, color = Sex)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, linetype = "solid") +
labs(title = "Correct Percentage vs. Age by Sex",
x = "Age",
y = "Correct Percentage (%)") +
scale_color_manual(values = c("darkblue", "darkred"),
labels = c("Female", "Male")) +
theme_publication +
annotate("text", x = min(rt_data$Age), y = min(rt_data$correct_percent),
label = paste("Males: p-value =", format.pval(percent_p_value_m, digits = 3),
"\nFemales: p-value =", format.pval(percent_p_value_f, digits = 3)),
hjust = 0, vjust = 0, size = 4) +
guides(color = guide_legend(title = "Sex"))
# Display the plots
print(rt_plot)
print(percent_plot)
So here we can see that there is a significant relationship between female age and reaction time (but not for males), but neither sex with correct percentage, although there are trends (i.e., the p-value is close to 0.05).
Finally, we would like to see if there is an effect of lesions on the reaction time or percent accuracy. We can firstly plot this for the entire group:
# Create a theme for scientific publication
theme_publication <- theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
legend.position = "none"
)
# Plot lesion_number against correct_percent
model_ln_cp <- lm(correct_percent ~ lesion_number, data = rt_data)
p_value_ln_cp <- summary(model_ln_cp)$coefficients[2, "Pr(>|t|)"]
plot_ln_cp <- ggplot(rt_data, aes(x = lesion_number, y = correct_percent)) +
geom_point(color = "darkblue", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "solid") +
labs(title = "Lesion Number vs. Correct Percentage",
x = "Lesion Number",
y = "Correct Percentage (%)") +
annotate("text", x = max(rt_data$lesion_number), y = max(rt_data$correct_percent),
label = paste("p-value =", format.pval(p_value_ln_cp, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Plot lesion_number against correct_rt
model_ln_rt <- lm(correct_rt ~ lesion_number, data = rt_data)
p_value_ln_rt <- summary(model_ln_rt)$coefficients[2, "Pr(>|t|)"]
plot_ln_rt <- ggplot(rt_data, aes(x = lesion_number, y = correct_rt)) +
geom_point(color = "darkred", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "solid") +
labs(title = "Lesion Number vs. Correct Reaction Time",
x = "Lesion Number",
y = "Correct Reaction Time (ms)") +
annotate("text", x = max(rt_data$lesion_number), y = max(rt_data$correct_rt),
label = paste("p-value =", format.pval(p_value_ln_rt, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Plot lesion_volume against correct_percent
model_lv_cp <- lm(correct_percent ~ lesion_volume, data = rt_data)
p_value_lv_cp <- summary(model_lv_cp)$coefficients[2, "Pr(>|t|)"]
plot_lv_cp <- ggplot(rt_data, aes(x = lesion_volume, y = correct_percent)) +
geom_point(color = "darkblue", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "solid") +
geom_vline(xintercept = 3, color = "red", linetype = "dashed") + # Add this line
labs(title = "Lesion Volume vs. Correct Percentage",
x = "Lesion Volume",
y = "Correct Percentage (%)") +
annotate("text", x = max(rt_data$lesion_volume), y = max(rt_data$correct_percent),
label = paste("p-value =", format.pval(p_value_lv_cp, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Plot lesion_volume against correct_rt
model_lv_rt <- lm(correct_rt ~ lesion_volume, data = rt_data)
p_value_lv_rt <- summary(model_lv_rt)$coefficients[2, "Pr(>|t|)"]
plot_lv_rt <- ggplot(rt_data, aes(x = lesion_volume, y = correct_rt)) +
geom_point(color = "darkred", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "solid") +
geom_vline(xintercept = 3, color = "red", linetype = "dashed") + # Add this line
labs(title = "Lesion Volume vs. Correct Reaction Time",
x = "Lesion Volume",
y = "Correct Reaction Time (ms)") +
annotate("text", x = max(rt_data$lesion_volume), y = max(rt_data$correct_rt),
label = paste("p-value =", format.pval(p_value_lv_rt, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Display the plots
print(plot_ln_cp)
print(plot_ln_rt)
print(plot_lv_cp)
print(plot_lv_rt)
But you can see that there are outliers for the last two graphs, indicated by those beyond the red line. What would happen if we removed these from the plots?
# Create a theme for scientific publication
theme_publication <- theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(size = 10),
legend.position = "none"
)
# Subset the data to exclude lesion_volume greater than 3
rt_data_subset <- subset(rt_data, lesion_volume <= 3)
# Plot lesion_number against correct_percent
model_ln_cp <- lm(correct_percent ~ lesion_number, data = rt_data_subset)
p_value_ln_cp <- summary(model_ln_cp)$coefficients[2, "Pr(>|t|)"]
plot_ln_cp <- ggplot(rt_data_subset, aes(x = lesion_number, y = correct_percent)) +
geom_point(color = "darkblue", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "solid") +
labs(title = "Lesion Number vs. Correct Percentage",
x = "Lesion Number",
y = "Correct Percentage (%)") +
annotate("text", x = max(rt_data_subset$lesion_number), y = max(rt_data_subset$correct_percent),
label = paste("p-value =", format.pval(p_value_ln_cp, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Plot lesion_number against correct_rt
model_ln_rt <- lm(correct_rt ~ lesion_number, data = rt_data_subset)
p_value_ln_rt <- summary(model_ln_rt)$coefficients[2, "Pr(>|t|)"]
plot_ln_rt <- ggplot(rt_data_subset, aes(x = lesion_number, y = correct_rt)) +
geom_point(color = "darkred", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "solid") +
labs(title = "Lesion Number vs. Correct Reaction Time",
x = "Lesion Number",
y = "Correct Reaction Time (ms)") +
annotate("text", x = max(rt_data_subset$lesion_number), y = max(rt_data_subset$correct_rt),
label = paste("p-value =", format.pval(p_value_ln_rt, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Plot lesion_volume against correct_percent
model_lv_cp <- lm(correct_percent ~ lesion_volume, data = rt_data_subset)
p_value_lv_cp <- summary(model_lv_cp)$coefficients[2, "Pr(>|t|)"]
plot_lv_cp <- ggplot(rt_data_subset, aes(x = lesion_volume, y = correct_percent)) +
geom_point(color = "darkblue", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkblue", linetype = "solid") +
labs(title = "Lesion Volume vs. Correct Percentage",
x = "Lesion Volume",
y = "Correct Percentage (%)") +
annotate("text", x = max(rt_data_subset$lesion_volume), y = max(rt_data_subset$correct_percent),
label = paste("p-value =", format.pval(p_value_lv_cp, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Plot lesion_volume against correct_rt
model_lv_rt <- lm(correct_rt ~ lesion_volume, data = rt_data_subset)
p_value_lv_rt <- summary(model_lv_rt)$coefficients[2, "Pr(>|t|)"]
plot_lv_rt <- ggplot(rt_data_subset, aes(x = lesion_volume, y = correct_rt)) +
geom_point(color = "darkred", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "solid") +
labs(title = "Lesion Volume vs. Correct Reaction Time",
x = "Lesion Volume",
y = "Correct Reaction Time (ms)") +
annotate("text", x = max(rt_data_subset$lesion_volume), y = max(rt_data_subset$correct_rt),
label = paste("p-value =", format.pval(p_value_lv_rt, digits = 3)),
hjust = 1, vjust = 1, size = 4) +
theme_publication
# Display the plots
print(plot_ln_cp)
print(plot_ln_rt)
print(plot_lv_cp)
print(plot_lv_rt)
So you can see a difference (particularly in the last graph) although none of the relationships become significant by removing the outliers.
Load in the data.
combined_data <- read_csv("../data/correct_rt_data.csv")
The code itself is not complicated at all (it’s only 8 lines), but we need to be specific in how we organise the data.
Specifically, we need to do the following: - Plot Reaction Time on the y-axis and Age on the x-axis - We group the data by ‘ID’ so that each participant will have a different colour - We calculate the mean Reaction Time for each ID, and add error bars - We add a line of best fit to the data
And that’s it!
# Create the scatter plot with mean and standard deviation error bars and a trend line
ggplot(combined_data, aes(x = age, y = RT, color = ID)) +
geom_point(alpha = 0.5) +
stat_summary(fun.data = "mean_sdl", fun.args = list(mult = 1), geom = "errorbar", width = 0.4, color = "black") +
stat_summary(fun = "mean", geom = "point", shape = 21, fill = "white", size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "black", fill = "gray", alpha = 0.2) +
labs(x = "Age", y = "Reaction Time", color = "ID") +
theme_classic() +
guides(color = FALSE)
Here is a slightly different version:
And if we want, we can combine the different graphs into a single image, which is common when writing scientific papers:
library(gridExtra)
library(dplyr)
library(grid)
# Filter out lesion volumes beyond 3
rt_data_filtered <- rt_data %>% filter(lesion_volume <= 3)
# Create a theme for publication
theme_publication <- theme_minimal() +
theme(
text = element_text(family = "Arial", size = 10),
axis.title = element_text(face = "bold", size = 10),
axis.text = element_text(size = 8),
legend.position = "none",
legend.title = element_text(size = 9, face = "bold"),
legend.text = element_text(size = 8),
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_blank(),
panel.border = element_rect(color = "black", fill = NA, size = 0.5),
plot.title = element_text(face = "bold", size = 12, hjust = -0.05, vjust = 1.5)
)
# Plot A: Violin plot of Age by Sex with individual data points
plot_A <- ggplot(rt_data_filtered, aes(x = Sex, y = Age, fill = Sex)) +
geom_violin(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.5, aes(color = Sex)) +
stat_summary(fun = mean, geom = "point", size = 3, color = "black") +
scale_fill_manual(values = c("blue", "red")) +
scale_color_manual(values = c("blue", "red")) +
labs(x = "Sex", y = "Age", title = "A") +
theme_publication
# Plot B: BMI vs correct_percent
plot_B <- ggplot(rt_data_filtered, aes(x = BMI, y = correct_percent)) +
geom_point(alpha = 0.5, color = "purple") +
geom_smooth(method = "lm", se = TRUE, color = "purple", fill = "lavender") +
labs(x = "BMI", y = "Correct Percentage", title = "B") +
theme_publication
# Plot C1: Scatter plot with trend lines (lesion number vs correct RT)
plot_C1 <- ggplot(rt_data_filtered, aes(x = lesion_number, y = correct_rt)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
labs(x = "Lesion Number", y = "Correct RT", title = "C") +
theme_publication
# Plot C2: Scatter plot with trend lines (lesion volume vs correct RT)
plot_C2 <- ggplot(rt_data_filtered, aes(x = lesion_volume, y = correct_rt)) +
geom_point(alpha = 0.5, color = "red") +
geom_smooth(method = "lm", se = TRUE, color = "red", fill = "pink") +
labs(x = "Lesion Volume", y = "Correct RT") +
theme_publication
# Plot D: Age against correct_percent separated by Sex
plot_D <- ggplot(rt_data_filtered, aes(x = Age, y = correct_percent, color = Sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
scale_color_manual(values = c("blue", "red")) +
labs(x = "Age", y = "Correct Percentage", title = "D") +
theme_publication +
theme(legend.position = c(0.85, 0.85),
legend.background = element_rect(fill = "white", color = "black"),
legend.title = element_text(size = 8),
legend.text = element_text(size = 7),
legend.key.size = unit(0.5, "lines"))
# Combine plots
final_plot <- grid.arrange(
plot_A, plot_C1, plot_C2,
plot_B, plot_D,
layout_matrix = rbind(c(1,2,3), c(4,5,5)),
widths = c(1, 1, 1),
heights = c(1, 1)
)